- Study the molecular basis of variation in development and disease
- Using high-throughput experimental methods
Septmeber 21, 2015
NHGRI strategic plan
NHGRI strategic plan
"The major bottleneck in genome sequencing is no longer data generation—the computational challenges around data analysis, display and integration are now rate limiting. New approaches and methods are required to meet these challenges."
My group's work as a simplex
What makes them different?
Much human variation is due to difference in ~6 million DNA base pairs (0.1% of genome)
What makes them different?
Genes are expressed differently during different stages and in different tissues.
DNA is packed, making certain parts inaccessible, and this packing is dynamic.
DNA methylation is a chemical modification of DNA, involved in gene expression regulation.
Large blocks of hypo-methylation in colon cancer
Genes with hyper-variable expression in colon cancer are enriched within these blocks.
Hypo-methylation blocks observed across five solid tumor types.
Gene expression hyper-variability enriched in hypo-methylation blocks in other cancer types.
Genes with consistent hyper-variable expression across tumors are tissue-specific.
Genes are expressed differently during different stages and in different tissues.
anti-profile score: measures sample-specific deviation from normal expression in consistently hyper-variable genes
\[ \log_2 \frac{\text{std. dev}_{\text{cancer}}}{\text{std. dev}_{\text{normal}}} \]
\[ \mathrm{med} \, \text{normal expression}_g \pm 5 \times \mathrm{mad} \, \text{normal expression}_g \]
Good cross-experiment properties
Stability in normal expression across experiments
Prediction in leave-one-tissue out experiment
Anti-profile score distinguishes between stages in tumor progression
DNA methylation anti-profiles score distinguishes between stages in tumor progression
Stratification based on anti-profile score
Stratification of breast samples based on anti-profile score
One-class Support Vector Machines
Support Vector Machines for Anomaly Detection: determine if observations belong to a given group or are anomalies.
Distinguish observations from two anomalous groups (e.g., adenoma vs. tumor)
How can we incorporate the fact that we are classifying anomalies?
Why (and when) is it worth doing that?
Learning functions in space spanned by (representers) of normal samples
\[ f(x) = \sum_i c_i k(x, z_i) + d \]
where \(z_i\) are normal observations.
Estimated as solution to optimization problem (like regular SVM) by solving
\[ \min_{c,d} \sum_j (1-y_jf_j)_+ + c'\tilde{K}c \]
with \(f_j = \sum_i c_i k(x_j,z_i) + d\)
and \(\tilde{K}=K_s K_n^{-1} K_s\)
Prediction of high vs. low relapse risk in lung cancer
Prediction of suspect vs. pathological fetal CTG data (not genomics)
Methylation pattern reconstruction problem
Methylation pattern reconstruction problem
\[ \mathbb{E} y_v = \sum_{u:(v,u) |in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \]
\[ \min_{\theta_p} \sum_v |y_v - \sum_{u:(v,u)\in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p | + \lambda \sum_p |\theta_p | \]
antiProfilesminfibumphunterRcplexRcsdpHTShapeqsmoothCollaborative and exploratory analysis
Bsmooth, minfi)epivizr packageCreativity in exploration
We are building software systems to support creative exploratory analysis of large genome-wide datasets…
Computed Measurements: create new measurements from integrated measurements and visualize
Summarization: summarize integrated measurements (computed on data subsets)
Statistically-guided exploration: Calculate a statistic of interest
# Get tumor methylation base-pair data m <- assay(se)[,"tumor"] # Compute regions with highest variability across cpgs region_stat <- calcWindowStat(m, step=25, window=80, stat=rowSds) s <- region_stat[,"stat"]
Explore data based on statistic
What's around the regions with highest across CpG variability?
# get locations in decreasing order o <- order(s, decreasing=TRUE) indices <- region_stat[o, "indices"] slideShowRegions <- rowRanges(se)[indices] + 1250000L mgr$slideshow(slideShowRegions)
dynamically extensible: Easily integrate new data types and add new visualizations.
One interpretation of Big Data is Many relevant sources of contextual data
Visualization goals
Visualization goals
metagenomeSeq, metagenomicFeatures, metavizCoordinates:
Hierachically organized features
Hierarchically organized features
NHGRI strategic plan
"Meeting the computational challenges for genomics requires scientists with expertise in biology as well as in informatics, computer science, mathematics, statistics and/or engineering."
A new generation of investigators who are proficient in two or more of these fields must be trained and supported.
Acknowledgements
Past members of HCBravo group
now at Harvard, U. Chicago, Johns Hopkins, Genentech, Dow Jones Data Science
Colleagues at CBCB
Current members of HCBravo group
Collaborators at JHU/Harvard
Funding: NIH, Genentech, Gates Foundation
More information